Automated Essay Scoring (AES) with R

At the core of Automated Essay Scoring (AES) is natural language processing (NLP) and machine learning. These systems are designed to analyze the text based on a set of predefined criteria, which may include factors such as grammar, sentence structure, vocabulary, coherence, and argumentative quality. The algorithms look for patterns and features in the text, such as the presence of thesis statements, use of evidence, and logical progression of ideas. This data is then used to generate a score that reflects the essay’s overall quality. In this blog post, we will explore the basics of NLP and AES with a prominent dataset: Automated Student Assessment Prize (ASAP) dataset sponsored by Hewlett. To see a deployed AES engine on this issue, see “my apps” tab in the navigation bar.

readxl
textstem
stopwords
stringr
quanteda
tm
tidyverse
Author
Affiliations

Ali Emre Karagül

Published

October 23, 2023

Mailchimp Subscription Modal

Introduction

In the realm of education and standardized testing, technological advancements have brought forth a significant transformation. Among the noteworthy innovations in this field is the adoption of Automated Essay Scoring (AES), an approach that leverages the capabilities of natural language processing (NLP) and machine learning. AES holds the potential to redefine the essay evaluation and grading process, offering efficiency, consistency, and accessibility in a way previously unattainable.

At its essence, AES relies on sophisticated algorithms to examine written text, subjecting it to a predefined set of criteria. These criteria encompass various aspects, including grammar, sentence structure, vocabulary, coherence, and the quality of argumentation. In this regard, AES algorithms function akin to meticulous digital assessors, diligently seeking out patterns and features within the text. They assess elements such as the presence of persuasive thesis statements, the adept use of supporting evidence, and the logical flow of ideas within the essay. The result of this is a numerical score that reflects the overall quality of the essay in question.

Today we will develop a linear regression model to predict essay scores using the famous ASAP dataset. There are eight different essay types in this dataset to explore. Here we will use the essay set 2. Before you continue, please go and read the description and the details on Kaggle. As we will use only one essay set, I have preapared the data as a separate csv file and if you are already familiar with the ASAP data (or you have taken a look at the Kaggle page), please download the data for our use case from here. The packages and some of their functions that we will be using in this topic are:

Requirements
library("rmarkdown") 
# paged_table() 
library("readxl") 
# read_excel()
library("tidyverse") 
library("textstem") 
# lemmatize_strings()
library("stopwords") 
# stopwords()
library("stringr")   
# str_squish()
library("tm")        
# remove_words()
library("quanteda")  
# nsentence() 

1. Preprocessing

Now that you have downloaded the data, move it to your working directory and follow along while reading. Let’s load the data and see the head of them:

essay_set_2 <- read_excel("essay_set_2.xlsx") 
paged_table(head(essay_set_2, 2)) # paged_table() function is for a beautified table view on this page. You don't need to use it on your own trial.

The nine columns are:

  • essay_id: a unique identifier for the record.

  • essay_set: That is set to 2 for all records in this dataset as we are working on essay set 2. We will get rid of this soon.

  • essay: The text of the essay written by a real person.

  • There are also 6 more columns that include rater1_domain1, rater2_domain1, domain1_score, rater1_domain2, rater2_domain2, and domain2_score.

Lets see a sample essay which is randomly selected:

set.seed(1234)
random_essay <- sample(1:length(essay_set_2), 1)
paged_table(essay_set_2[random_essay,3])

We will use the essay and rater1_domain1 columns for our analysis as we are just doing a research for demonstration purposes (of course, when you are working on a real life research case, there are many things that you have to take into consideration). The essay column is the text of the essay written by a real person and the rater1_domain1 column is the score given to the essay by the first rater. We will rename these columns as response and score respectively. We will also create a new column called doc_id which will be a unique identifier for each essay in our case study.

set <- essay_set_2 %>% 
  select(essay, rater1_domain1) %>% 
  rename(response = essay, score = rater1_domain1)
set$doc_id <- paste0("doc", 1:nrow(set))

When you scrutinize the essays, you will see that there are some @words in the text. These are the words that are replaced with private information about the students such as names, places or dates etc. Also, there are some non-alphabetic characters and capital letters. In order to make the essays ready to analysis, we will convert all the text to lowercase and remove the non-alphabetic characters. We will also remove the stopwords and lemmatize the text.

set$processedResponse <- gsub("@\\w+ *", "", set$response) #remove @words
set$processedResponse <- gsub("[^a-zA-Z]", " ", set$processedResponse) #remove non-alphabetic characters
set$processedResponse <- tolower(set$processedResponse) #convert to lowercase
set$processedResponse <- lemmatize_strings(set$processedResponse) #lemmatize
en_stopwords <- stopwords::stopwords("en", source = "stopwords-iso") #get stopwords
set$processedResponse <- removeWords(set$processedResponse, words = en_stopwords) #remove stopwords
set$processedResponse <- str_squish(set$processedResponse) #remove extra whitespaces

Here how the new processed text looks like (the one we have seen above):

sample_essay<-merge(set[random_essay,4], set[random_essay,1])
paged_table(sample_essay)

2. Feature Extraction

Feature extraction is a critical component of Automated Essay Scoring (AES), serving as the foundation upon which the system’s evaluation process is built. In AES, feature extraction procedures involve the identification and analysis of various linguistic and structural elements within the essay. These elements may include but are not limited to word frequency, sentence length, syntactic complexity, vocabulary richness, and the presence of specific content-related markers like thesis statements or evidence citations. The goal of feature extraction is to distill the complex nature of written language into a set of quantifiable, machine-readable attributes that can be used by the AES algorithms to assess the quality and coherence of the essay. Through a combination of natural language processing and statistical analysis, feature extraction empowers AES systems to objectively evaluate essays, providing educators and test administrators with consistent, efficient, and data-driven grading outcomes.

For the sake of simplicity of demonstration, we will extract the following features: number of sentences, number of paragraphs and number of contextual words.

set$n_sentence <- nsentence(set$response) 
set$n_paragraph <- str_count(set$response, "     ") + 1 
set$n_contextWords <- lengths(strsplit(set$processedResponse, ' '))
head(set)
# A tibble: 6 × 7
  response  score doc_id processedResponse n_sentence n_paragraph n_contextWords
  <chr>     <dbl> <chr>  <chr>                  <int>       <dbl>          <int>
1 Certain …     4 doc1   material remove …         20           4            126
2 Write a …     1 doc2   write persuasive…          3           4             33
3 Do you t…     2 doc3   library remove m…         15           3             59
4 In @DATE…     4 doc4   offensive opinio…         31           5            111
5 In life …     4 doc5   life offensive s…         35          11            131
6 A lot of…     4 doc6   lot censor conte…         25           5             79

As stated before, there might be many other features that can be extracted from the essays. We will continue our study with these 3 features even if they are not enough for a real life research.

Before we start building a machine learning model, we need to divide our dataset as training and test data.

dataset <- set
train_indices <- sample(nrow(dataset), nrow(dataset) * 0.7)  # 70% for training
train_data <- dataset[train_indices, ]
test_data <- dataset[-train_indices, ]

3. Linear Regression Model (LR)

What is Linear Regression?

Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship basically means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too. Linear regression models are used to show or predict the relationship between two variables or factors. The factor that is being predicted (the factor that the equation solves for) is called the dependent variable. The factors that are used to predict the value of the dependent variable are called the independent variables. The results of a LR model can be evaluated using statistical metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and R-squared.

In our case, the dependent variable is the score and the independent variables are n_sentence, n_paragraph and n_contextWords.

Linear Model 1
formula <- as.formula("score ~ n_sentence + n_paragraph + n_contextWords")
model_1 <- lm(formula, data = train_data)
# Print the summary of the model
summary(model_1)

Call:
lm(formula = formula, data = train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.52359 -0.36412  0.00356  0.40190  1.93435 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2.1864591  0.0418320  52.268  < 2e-16 ***
n_sentence     0.0224310  0.0027105   8.276 3.23e-16 ***
n_paragraph    0.0023062  0.0034157   0.675      0.5    
n_contextWords 0.0076319  0.0005822  13.109  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5595 on 1256 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4688 
F-statistic: 371.4 on 3 and 1256 DF,  p-value: < 2.2e-16

Performance of the Model 1

Let’s investigate the output from the model one by one.

  1. Residuals: This section provides statistics about the residuals, which are the differences between the observed values and the predicted values by the model.

    • Min: The minimum residual value is -2.51929.

    • 1Q: The first quartile (25th percentile) of the residuals is -0.34048.

    • Median: The median of the residuals is 0.00576.

    • 3Q: The third quartile (75th percentile) of the residuals is 0.37803.

    • Max: The maximum residual value is 1.95369.

  2. Coefficients: This section provides information about the coefficients of the linear regression model. Each row corresponds to a predictor variable (independent variable) in the model.

  3. Estimate: This is the estimated coefficient for each predictor. Std. Error: It represents the standard error of the coefficient estimate. t value: The t-value is a measure of how many standard errors the coefficient estimate is away from zero. Pr(>|t|): This is the p-value associated with the t-value, which tells you whether the coefficient is statistically significant. In our model, we have three predictor variables: n_sentence, n_paragraph, and n_contextWords. The Estimate column represents the estimated coefficients for each of these predictors. The *** symbols indicate that these coefficients are highly statistically significant. In other words, the number of paragraph is not significantly valuable for the model while the number of sentences and the number of contextual words are highly significant.

  4. Multiple R-squared: This is a measure of how well the model fits the data. It tells you the proportion of the variance in the dependent variable that is explained by the independent variables. In our case, R-squared is approximately 0.4737, which means that about 47.37% of the variance in the dependent variable is explained by our predictors.

  5. Adjusted R-squared: This is a version of R-squared that adjusts for the number of predictors in the model. It is a slightly more conservative measure of goodness of fit.

  6. F-statistic: This is a measure of the overall significance of the model. It tests whether at least one of the predictor variables is significantly related to the dependent variable. A high F-statistic and a very low p-value (which is the case here) indicate that the model is significant.

  7. p-value: The p-value associated with the F-statistic. In this case, it’s extremely small, indicating that the overall model is highly significant.

Comparison of Model 1 and Model 2

Let’s build another model without the feature n_paragraph and compare the results.

Linear Model 2
formula_2 <- as.formula("score ~ n_sentence + n_contextWords")
model_2 <- lm(formula_2, data = train_data)
# Print the summary of the model
summary(model_2)

Call:
lm(formula = formula_2, data = train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.51924 -0.36470  0.00476  0.39940  1.93587 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2.195033   0.039849  55.083   <2e-16 ***
n_sentence     0.022599   0.002698   8.375   <2e-16 ***
n_contextWords 0.007637   0.000582  13.123   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5594 on 1257 degrees of freedom
Multiple R-squared:  0.4699,    Adjusted R-squared:  0.4691 
F-statistic: 557.1 on 2 and 1257 DF,  p-value: < 2.2e-16

Residual standard error: The residual standard error is very close in both models, indicating that they have similar predictive accuracy. Multiple R-squared: The R-squared values are also very close. Model 2 has a slightly lower R-squared, but the difference is minimal. F-statistic: Model 2 has a higher F-statistic compared to Model 1, indicating that the predictors collectively have more explanatory power in Model 2.

In summary, both models are quite similar in terms of their coefficient estimates and predictive accuracy. Model 2 has a slightly higher F-statistic. Personally I agree with the model. The number of paragraphs should not be a predictor while estimating the score. The number of sentences and the number of contextual words are much more important.

Evaluation of Model 2 with the Test Dataset

As we would like to continue with the second model, we can use the predict() to make predictions on the test data. The predict() function takes the model and the new data set as arguments. It returns a vector of predictions, which we will save in a new column in the test data set to make comparisons in the next stage. Now, let’s use the mean() and sqrt() functions to calculate the MSE and RMSE of the model. We can also use the summary() function to get the R-squared value of the model.

# Make predictions on the test data
predictions <- predict(model_2, newdata = test_data)
# Evaluate the model
mse <- mean((test_data$score - predictions)^2)
rmse <- sqrt(mse)
r_squared <- summary(model_2)$r.squared
# Printing evaluation metrics beautifully :D
cat(paste0("Mean Squared Error (MSE): ", sprintf("%.2f", mse), "\n",
            "Root Mean Squared Error (RMSE): ", sprintf("%.2f", rmse), "\n",
            "R-squared (R²): ", sprintf("%.2f", r_squared), "\n"))
Mean Squared Error (MSE): 0.30
Root Mean Squared Error (RMSE): 0.54
R-squared (R²): 0.47
  • MSE (Mean Squared Error): MSE is a measure of the average squared difference between the observed (actual) values and the predicted values by the model. It’s a measure of the model’s accuracy, with lower values indicating a better fit.

  • RMSE (Root Mean Squared Error): RMSE is the square root of the MSE and provides a measure of the average error in the same units as the dependent variable. It’s a commonly used metric to quantify the prediction error of the model.

  • R-squared (R²): R-squared is a measure of how well the independent variables explain the variability in the dependent variable. It ranges from 0 to 1, with higher values indicating a better fit. In the output, the R-squared is approximately 0.4736732, which means that about 47.37% of the variance in the dependent variable is explained by the independent variables in the model.

Predictions of the Test Dataset

Let’s take a look at the predictions of the test data. We can use the head() function to print the first 10 predictions.

#merge predictions to test data:
test_data$predictions <- round(predictions)
head(test_data[,c("score","predictions")], 10)
# A tibble: 10 × 2
   score predictions
   <dbl>       <dbl>
 1     4           4
 2     1           3
 3     2           3
 4     4           4
 5     5           4
 6     2           4
 7     4           3
 8     3           3
 9     3           3
10     3           2

But looking at the raw data would never be enough to get some insights. Let’s also print the confusion matrix to get a better overall picture. The confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. The confusion matrix shows the ways in which the model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

# Create the confusion matrix and print it
confusion_matrix <- table(test_data$predictions, test_data$score)
confusion_matrix
   
      1   2   3   4   5   6
  2   7  13   1   0   0   0
  3   2  35 186  85   1   0
  4   0   1  30 148  14   0
  5   0   0   0   7   8   0
  6   0   0   0   0   1   1

In this context, the confusion matrix helps us understand the model’s performance in grading essays. It indicates which score points are frequently confused with others, providing insights into where the model might need improvement. For instance, score points 3 and 4 appear to be frequently confused with each other. Yet, I believe that is not a big deal. Confusing a high score such as 6 with a lower score such as 2 would be a catastrophe in AES context, but here we don’t even have one such a case. Besides confusion matrix, one can calculate performance metrics such as accuracy, precision, recall, and F1-score to get a more comprehensive assessment of the model’s grading performance.

Conclusion

Our exploration into AES has shed light on the impressive potential of this technology in revolutionizing the evaluation of written content. Our application of a Linear Regression model, even when considering a limited set of extracted features such as the number of words, sentences, and paragraphs, demonstrated the robustness of this approach. It is evident that by harnessing machine learning algorithms, we can achieve consistent and objective grading, streamlining the assessment process for educators and administrators.

However, it’s important to note that Linear Regression is just one of the many models available for AES. The field of automated essay scoring continues to evolve, and researchers are exploring a range of models and techniques to enhance accuracy and broaden the scope of assessment. Some alternatives to LR include Support Vector Machines (SVM), Random Forests, Neural Networks, and Natural Language Processing models like Recurrent Neural Networks (RNNs) or Transformers, such as BERT and GPT-3. These models bring their unique strengths and capabilities to the table, offering a diverse array of tools for essay evaluation.

As AES advances, the synergy of these models with increasingly sophisticated feature extraction procedures promises to further elevate the quality and reliability of automated essay scoring. With this, we look to a future where technology and human expertise collaborate seamlessly to offer more efficient, accurate, and comprehensive assessment in the realm of written expression.